尽管电子健康记录是生物医学研究的丰富数据来源,但这些系统并未在医疗环境中统一地实施,并且由于医疗保健碎片化和孤立的电子健康记录之间缺乏互操作性,可能缺少大量数据。考虑到缺少数据的案例的删除可能会在随后的分析中引起严重的偏见,因此,一些作者更喜欢采用多重插补策略来恢复缺失的信息。不幸的是,尽管几项文献作品已经通过使用现在可以自由研究的任何不同的多个归档算法记录了有希望的结果,但尚无共识,MI算法效果最好。除了选择MI策略之外,归纳算法及其应用程序设置的选择也至关重要且具有挑战性。在本文中,受鲁宾和范布伦的开创性作品的启发,我们提出了一个方法学框架,可以应用于评估和比较多种多个插补技术,旨在选择用于计算临床研究工作中最有效的推断。我们的框架已被应用于验证和扩展较大的队列,这是我们在先前的文献研究中提出的结果,我们在其中评估了关键患者的描述符和Covid-19的影响在2型糖尿病患者中的影响,其数据为2型糖尿病,其数据为2型糖尿病由国家共同队列合作飞地提供。
translated by 谷歌翻译
Associazione Medici Diabetologi(AMD)收集并管理着全球最大的糖尿病患者记录集合之一,也称为AMD数据库。本文介绍了一个正在进行的项目的初步结果,该项目的重点是人工智能和机器学习技术的应用,以概念化,清洁和分析如此重要且有价值的数据集,目的是提供预测性见解,以更好地支持糖尿病学家的诊断糖尿病学家和治疗选择。
translated by 谷歌翻译
图表学习方法为解决图形所代表的复杂的现实世界问题打开了新的可能性。但是,这些应用程序中使用的许多图包括数百万节点和数十亿个边缘,并且超出了当前方法和软件实现的功能。我们提供葡萄,这是一种用于图形处理和表示学习的软件资源,能够通过使用专业和智能数据结构,算法和快速并行实现来通过大图扩展。与最先进的软件资源相比,葡萄显示出经验空间和时间复杂性的数量级的改善,以及边缘预测和节点标签预测性能的实质和统计学上的显着改善。此外,葡萄提供了来自文献和其他来源的80,000多种图,标准化界面允许直接整合第三方库,61个节点嵌入方法,25个推理模型和3个模块化管道,以允许公平且可重复的方法比较以及用于图形处理和嵌入的库。
translated by 谷歌翻译
Numerous works use word embedding-based metrics to quantify societal biases and stereotypes in texts. Recent studies have found that word embeddings can capture semantic similarity but may be affected by word frequency. In this work we study the effect of frequency when measuring female vs. male gender bias with word embedding-based bias quantification methods. We find that Skip-gram with negative sampling and GloVe tend to detect male bias in high frequency words, while GloVe tends to return female bias in low frequency words. We show these behaviors still exist when words are randomly shuffled. This proves that the frequency-based effect observed in unshuffled corpora stems from properties of the metric rather than from word associations. The effect is spurious and problematic since bias metrics should depend exclusively on word co-occurrences and not individual word frequencies. Finally, we compare these results with the ones obtained with an alternative metric based on Pointwise Mutual Information. We find that this metric does not show a clear dependence on frequency, even though it is slightly skewed towards male bias across all frequencies.
translated by 谷歌翻译
Accurate uncertainty quantification is necessary to enhance the reliability of deep learning models in real-world applications. In the case of regression tasks, prediction intervals (PIs) should be provided along with the deterministic predictions of deep learning models. Such PIs are useful or "high-quality'' as long as they are sufficiently narrow and capture most of the probability density. In this paper, we present a method to learn prediction intervals for regression-based neural networks automatically in addition to the conventional target predictions. In particular, we train two companion neural networks: one that uses one output, the target estimate, and another that uses two outputs, the upper and lower bounds of the corresponding PI. Our main contribution is the design of a loss function for the PI-generation network that takes into account the output of the target-estimation network and has two optimization objectives: minimizing the mean prediction interval width and ensuring the PI integrity using constraints that maximize the prediction interval probability coverage implicitly. Both objectives are balanced within the loss function using a self-adaptive coefficient. Furthermore, we apply a Monte Carlo-based approach that evaluates the model uncertainty in the learned PIs. Experiments using a synthetic dataset, six benchmark datasets, and a real-world crop yield prediction dataset showed that our method was able to maintain a nominal probability coverage and produce narrower PIs without detriment to its target estimation accuracy when compared to those PIs generated by three state-of-the-art neural-network-based methods.
translated by 谷歌翻译
A quantitative assessment of the global importance of an agent in a team is as valuable as gold for strategists, decision-makers, and sports coaches. Yet, retrieving this information is not trivial since in a cooperative task it is hard to isolate the performance of an individual from the one of the whole team. Moreover, it is not always clear the relationship between the role of an agent and his personal attributes. In this work we conceive an application of the Shapley analysis for studying the contribution of both agent policies and attributes, putting them on equal footing. Since the computational complexity is NP-hard and scales exponentially with the number of participants in a transferable utility coalitional game, we resort to exploiting a-priori knowledge about the rules of the game to constrain the relations between the participants over a graph. We hence propose a method to determine a Hierarchical Knowledge Graph of agents' policies and features in a Multi-Agent System. Assuming a simulator of the system is available, the graph structure allows to exploit dynamic programming to assess the importances in a much faster way. We test the proposed approach in a proof-of-case environment deploying both hardcoded policies and policies obtained via Deep Reinforcement Learning. The proposed paradigm is less computationally demanding than trivially computing the Shapley values and provides great insight not only into the importance of an agent in a team but also into the attributes needed to deploy the policy at its best.
translated by 谷歌翻译
In recent years there has been growing attention to interpretable machine learning models which can give explanatory insights on their behavior. Thanks to their interpretability, decision trees have been intensively studied for classification tasks, and due to the remarkable advances in mixed-integer programming (MIP), various approaches have been proposed to formulate the problem of training an Optimal Classification Tree (OCT) as a MIP model. We present a novel mixed-integer quadratic formulation for the OCT problem, which exploits the generalization capabilities of Support Vector Machines for binary classification. Our model, denoted as Margin Optimal Classification Tree (MARGOT), encompasses the use of maximum margin multivariate hyperplanes nested in a binary tree structure. To enhance the interpretability of our approach, we analyse two alternative versions of MARGOT, which include feature selection constraints inducing local sparsity of the hyperplanes. First, MARGOT has been tested on non-linearly separable synthetic datasets in 2-dimensional feature space to provide a graphical representation of the maximum margin approach. Finally, the proposed models have been tested on benchmark datasets from the UCI repository. The MARGOT formulation turns out to be easier to solve than other OCT approaches, and the generated tree better generalizes on new observations. The two interpretable versions are effective in selecting the most relevant features and maintaining good prediction quality.
translated by 谷歌翻译
Hierarchical time series are common in several applied fields. Forecasts are required to be coherent, that is, to satisfy the constraints given by the hierarchy. The most popular technique to enforce coherence is called reconciliation, which adjusts the base forecasts computed for each time series. However, recent works on probabilistic reconciliation present several limitations. In this paper, we propose a new approach based on conditioning to reconcile any type of forecast distribution. We then introduce a new algorithm, called Bottom-Up Importance Sampling, to efficiently sample from the reconciled distribution. It can be used for any base forecast distribution: discrete, continuous, or in the form of samples, providing a major speedup compared to the current methods. Experiments on several temporal hierarchies show a significant improvement over base probabilistic forecasts.
translated by 谷歌翻译
由于存在对抗性攻击,因此在安全至关重要系统中使用神经网络需要安全,可靠的模型。了解任何输入X的最小对抗扰动,或等效地知道X与分类边界的距离,可以评估分类鲁棒性,从而提供可认证的预测。不幸的是,计算此类距离的最新技术在计算上很昂贵,因此不适合在线应用程序。这项工作提出了一个新型的分类器家族,即签名的距离分类器(SDC),从理论的角度来看,它直接输出X与分类边界的确切距离,而不是概率分数(例如SoftMax)。 SDC代表一个强大的设计分类器家庭。为了实际解决SDC的理论要求,提出了一种名为Unitary级别神经网络的新型网络体系结构。实验结果表明,所提出的体系结构近似于签名的距离分类器,因此允许以单个推断为代价对X进行在线认证分类。
translated by 谷歌翻译
联合学习是用于培训分布式,敏感数据的培训模型的流行策略,同时保留了数据隐私。先前的工作确定了毒害数据或模型的联合学习方案的一系列安全威胁。但是,联合学习是一个网络系统,客户与服务器之间的通信对于学习任务绩效起着至关重要的作用。我们强调了沟通如何在联邦学习中引入另一个漏洞表面,并研究网络级对手对训练联合学习模型的影响。我们表明,从精心选择的客户中删除网络流量的攻击者可以大大降低目标人群的模型准确性。此外,我们表明,来自少数客户的协调中毒运动可以扩大降低攻击。最后,我们开发了服务器端防御,通过识别和上采样的客户可能对目标准确性做出积极贡献,从而减轻了攻击的影响。我们在三个数据集上全面评估了我们的攻击和防御,假设具有网络部分可见性的加密通信渠道和攻击者。
translated by 谷歌翻译